Empirical Analysis of Cuckoo Hashing Insert Complexity

This notebook attempts to analyze the number of disk reads required for different formats of a hash table that uses the Cuckoo format for collison handling.

Disk reads are calculated theorectically by assuming that each time a (key, value) pair is moved, that this new section of our data structure will need to be read from disk storage.

Schema Information

Here we are getting the schema and setting up our data for the analysis

EDA

We want to understand some basic information about our dataset. This will be a breif analysis of number of rows and distribution

Descriptive Statistics

Basic Plots

Here we look at a couple of lots that show a 1% sample of our data in order to understand the results better. This is used to inform the direct of a regression or other metric for empirically defining the number of disk reads a Hash Table would require if it was not stored in main memory.

We see the number of disk reads required appears to grow exponentially with the load factor (also called the utilization ratio here). Additionally, the growth rate of this hypothetical function is different between the number of tables. This follows from the theoretical likelihood for the number of operations required for insertion given the load factor and number of tables.

Distribution Analysis

If we think about how we might express a function for the information found in this data, we should look at the log transform of the data in order to possibly have a linear regression; however, you will immediately notice that the data does not follow a linear pattern and that even a fitted regression to this data expresed by some unknown function would attempt to reduce the error (most commonly by RSS). Such a function would still poorly describe our data except by the fact that we have a concept of the expected case. Rather since we know our data structure has a worst-case ammoratized constant time search complexity, we we consider ourselves with a probablistic worst-case complexity by finding the 90%, 95%, and 99% thresholds for insertion complexity. To find the probablistic complexity we will look for a distribution that matches our underlying data.

If we consider the underlying events that the data is representing, we see that the number of disk reads that an insertion requires direct corresponds to the number of times that a (key, value) pair is inserted into a filled table slot. For the ideal hash function the maps to a uniform probability distribution across its range, our event is that the table slot fails to be empty which has a probability of $p(x) = 1 - \frac{\textrm{load factor}}{\textrm{number of tables} \ \cdot \ \textrm{length of tables}}$. This event is repeated every time that we have a theorectial disk read. Since we are repeating independent events with a technically infinite domain, the discrete probability distribution is either poisson (average number of events occuring for a load factor) or geometric (rate at which our event occurs for a load factor).

This is further supported by a histogram of number of disk reads for a load factor of 50% on two tables. We also consider these two distributions for the case with two tables.

Remove outliers

If we bin the data we find that most of our points fall into the very low range. We want to generate the average of our data so that we can find the theoretical distribution without having the outliers skew the data too much. We consider the poisson distribution first which is defined by $p(x, \lambda) = \frac{e^{-\lambda}\lambda^x}{x!}$, where $\lambda = \mu = \sigma^2$. Thus we will plot the distribution using the $\lambda$ derived from the mean and standard deviation considering data points less than the outlier cutoffs. The geometric distribution is defined by $p(x, r) = (1 - r)^x(r)$, where $r = \frac{1}{1+\mu}$.

The outlier thresholds that we consider are 2, 3, and 5.

Goodness of Fit tests for Poisson and Geometric distributions

Here we want to identify (across more items than we can look at graphically) whether or not our underlying distribution is poisson or geometric. This will involve a couple of statistical tests. If we are unable to statistically support our assumption of a geometric distribution, we will have to assume this is the theorectical distribution based solely on our reasoning of the problem.

Useful Articles:

A modified version of the Smooth Goodness of Fit Test $S_1$ (Best and Rayner) had the highest power for testing if the distribution is Geometric or Poisson (ÖZonur et al.) - first article above. This is defined by:

$C(n, k) = k - \textrm{combinations of} \ n \ \textrm{elements} = \frac{n!}{(n-k)!k!}$

$a = \frac{q}{1-q}$

$K = \frac{1}{r!(a^2 + a)^{r/2})}$

$h_r(X_j, \hat{q}) = K\Sigma_{i=0}^{r} C(r-i, x) * (C(i, r))^2 * i! (r-i)! (-a)i$

$h_r$ is a special case of Meixner polynomials defined by a recurrance relation in Rayer and Best (1989). We need not concern ourseleves too much with generation of these polynomials since the modified version uses only $S_1 = U^2_2$ meaning we only need $h_2(X_j, \hat{q})$

$h_2(X_j, \hat{q}) = x(x-1)-4ax+2a^2$

$U_r = \Sigma_{j=1}^n h_r(X_j, \hat{q}) / \sqrt{n}$

$S_c = U^2_2 + \cdots + U^2_{c+1}$

$S^*_1 = nS_1 / \Sigma_{j=1}^{n} h^2_2(X_j, \hat{q})$

If $S^*_1 > \chi_1^2$ then $H_0$ is rejected.

Thus, all of these together give us the large formula below

\begin{equation} S_1^* = \frac{n \Sigma_{j=1}^{n} h_r(X_j, \hat{q}) / \sqrt{n}}{\Sigma_{j=1}^{n} h_2^2(X_j, \hat{q})} \end{equation}

We want to now calculate this test for geometric goodness of fit for each percentile of load factor in our data. This will be calculating the chi-squared test and the defined $S_1^*$ test above for each load factor percentile and number of tables (100 * 3 = 300 times). We will see what passes our assumption of geometric, and what does not.

We will also test if the data passes a test to be poisson distributed. This will be the index of dispersion (or Variance to Mean Ratio) test (https://www.itl.nist.gov/div898/software/dataplot/refman2/auxillar/ind_disp.htm)

This found the that the index of dispersion shows a ratio of variance to mean, and if data is Poisson then this ration should be 1. If the data is geometric, then the dispersion is >1.

\begin{equation} I = \Sigma_{i=1}^{N} \frac{(X_i - \bar{X})^2}{\bar{X}} = \frac{\sigma^2}{\bar{x}} \end{equation}

Bootstrapping an Accurate Representation of Our Data

Here we randomly sample our data multiple times in order to find a bootstrapped average to use in our theorectical distribution.

We want to find the mean for number of reads at each of the percentiles this will be done for each combination of hashes. The best hash combination will be the one with the lowest average mean num reads.

Let's look at the best combination of the hash functions for our cuckoo hashing based on the lowest mean number of reads. We look at the lower and upper bounds for our confidence intervals for the mean number of reads below the theoretical thresholds of 50% for two tables, 91% for three tables, and 97% for four tables (http://www.eecs.harvard.edu/~michaelm/postscripts/esa2009.pdf, https://hal.inria.fr/hal-01184689/document)